{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><h1>参考答案</h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第一章 预备知识\n",
    "\n",
    "### Ex1：利用列表推导式写矩阵乘法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M1 = np.random.rand(2,3)\n",
    "M2 = np.random.rand(3,4)\n",
    "res = [[sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]\n",
    "(np.abs((M1@M2 - res) < 1e-15)).all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：更新矩阵"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.83333333, 3.66666667, 5.5       ],\n",
       "       [2.46666667, 3.08333333, 3.7       ],\n",
       "       [2.65277778, 3.03174603, 3.41071429]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.arange(1,10).reshape(3,-1)\n",
    "B = A*(1/A).sum(1).reshape(-1,1)\n",
    "B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex3：卡方统计量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11.842696601945802"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(0)\n",
    "A = np.random.randint(10, 20, (8, 5))\n",
    "B = A.sum(0)*A.sum(1).reshape(-1, 1)/A.sum()\n",
    "res = ((A-B)**2/B).sum()\n",
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex4：改进矩阵计算的性能\n",
    "\n",
    "原方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100566"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(0)\n",
    "m, n, p = 100, 80, 50\n",
    "B = np.random.randint(0, 2, (m, p))\n",
    "U = np.random.randint(0, 2, (p, n))\n",
    "Z = np.random.randint(0, 2, (m, n))\n",
    "def solution(B=B, U=U, Z=Z):\n",
    "    L_res = []\n",
    "    for i in range(m):\n",
    "        for j in range(n):\n",
    "            norm_value = ((B[i]-U[:,j])**2).sum()\n",
    "            L_res.append(norm_value*Z[i][j])\n",
    "    return sum(L_res)\n",
    "solution(B, U, Z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "改进方法：\n",
    "\n",
    "令$Y_{ij} = \\|B_i-U_j\\|_2^2$，则$\\displaystyle R=\\sum_{i=1}^m\\sum_{j=1}^n Y_{ij}Z_{ij}$，这在`Numpy`中可以用逐元素的乘法后求和实现，因此问题转化为了如何构造`Y`矩阵。\n",
    "\n",
    "$$\n",
    "\\begin{split}Y_{ij} &= \\|B_i-U_j\\|_2^2\\\\\n",
    "&=\\sum_{k=1}^p(B_{ik}-U_{kj})^2\\\\\n",
    "&=\\sum_{k=1}^p B_{ik}^2+\\sum_{k=1}^p U_{kj}^2-2\\sum_{k=1}^p B_{ik}U_{kj}\\\\\\end{split}\n",
    "$$\n",
    "\n",
    "从上式可以看出，第一第二项分别为$B$的行平方和与$U$的列平方和，第三项是两倍的内积。因此，$Y$矩阵可以写为三个部分，第一个部分是$m×n$的全$1$矩阵每行乘以$B$对应行的行平方和，第二个部分是相同大小的全$1$矩阵每列乘以$U$对应列的列平方和，第三个部分恰为$B$矩阵与$U$矩阵乘积的两倍。从而结果如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100566"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(((B**2).sum(1).reshape(-1,1) + (U**2).sum(0) - 2*B@U)*Z).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对比它们的性能："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "48.5 ms ± 4.39 ms per loop (mean ± std. dev. of 7 runs, 30 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 30 solution(B, U, Z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.11 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 30 ((np.ones((m,n))*(B**2).sum(1).reshape(-1,1) + np.ones((m,n))*(U**2).sum(0) - 2*B@U)*Z).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex5：连续整数的最大长度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f = lambda x:np.diff(np.nonzero(np.r_[1,np.diff(x)!=1,1])).max()\n",
    "f([1,2,5,6,7])\n",
    "f([3,2,1,2,3,4,6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第二章 pandas基础\n",
    "### Ex1：口袋妖怪数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/pokemon.csv')\n",
    "(df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].sum(1)!=df['Total']).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.\n",
    "(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Water', 'Normal', 'Grass'], dtype='object')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dp_dup = df.drop_duplicates('#', keep='first')\n",
    "dp_dup['Type 1'].nunique()\n",
    "dp_dup['Type 1'].value_counts().index[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "143"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "attr_dup = dp_dup.drop_duplicates(['Type 1', 'Type 2'])\n",
    "attr_dup.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "170"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L_full = [i+' '+j if i!=j else i for i in df['Type 1'].unique() for j in df['Type 1'].unique()]\n",
    "L_part = [i+' '+j if not isinstance(j, float) else i for i, j in zip(df['Type 1'], df['Type 2'])]\n",
    "res = set(L_full).difference(set(L_part))\n",
    "len(res) # 太多，不打印了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.\n",
    "(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    low\n",
       "1    mid\n",
       "2    mid\n",
       "3    mid\n",
       "4    mid\n",
       "Name: Attack, dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Attack'].mask(df['Attack']>120, 'high').mask(df['Attack']<50, 'low').mask((50<=df['Attack'])&(df['Attack']<=120), 'mid').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    GRASS\n",
       "1    GRASS\n",
       "2    GRASS\n",
       "3    GRASS\n",
       "4     FIRE\n",
       "Name: Type 1, dtype: object"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Type 1'].replace({i:str.upper(i) for i in df['Type 1'].unique()})\n",
    "df['Type 1'].apply(lambda x:str.upper(x)).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Deviation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>213</td>\n",
       "      <td>Shuckle</td>\n",
       "      <td>Bug</td>\n",
       "      <td>Rock</td>\n",
       "      <td>505</td>\n",
       "      <td>20</td>\n",
       "      <td>10</td>\n",
       "      <td>230</td>\n",
       "      <td>10</td>\n",
       "      <td>230</td>\n",
       "      <td>5</td>\n",
       "      <td>215.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <td>113</td>\n",
       "      <td>Chansey</td>\n",
       "      <td>Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>450</td>\n",
       "      <td>250</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>35</td>\n",
       "      <td>105</td>\n",
       "      <td>50</td>\n",
       "      <td>207.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <td>242</td>\n",
       "      <td>Blissey</td>\n",
       "      <td>Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>540</td>\n",
       "      <td>255</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>75</td>\n",
       "      <td>135</td>\n",
       "      <td>55</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>333</th>\n",
       "      <td>306</td>\n",
       "      <td>AggronMega Aggron</td>\n",
       "      <td>Steel</td>\n",
       "      <td>NaN</td>\n",
       "      <td>630</td>\n",
       "      <td>70</td>\n",
       "      <td>140</td>\n",
       "      <td>230</td>\n",
       "      <td>60</td>\n",
       "      <td>80</td>\n",
       "      <td>50</td>\n",
       "      <td>155.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224</th>\n",
       "      <td>208</td>\n",
       "      <td>SteelixMega Steelix</td>\n",
       "      <td>Steel</td>\n",
       "      <td>Ground</td>\n",
       "      <td>610</td>\n",
       "      <td>75</td>\n",
       "      <td>125</td>\n",
       "      <td>230</td>\n",
       "      <td>55</td>\n",
       "      <td>95</td>\n",
       "      <td>30</td>\n",
       "      <td>145.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       #                 Name  Type 1  Type 2  Total   HP  Attack  Defense  \\\n",
       "230  213              Shuckle     Bug    Rock    505   20      10      230   \n",
       "121  113              Chansey  Normal     NaN    450  250       5        5   \n",
       "261  242              Blissey  Normal     NaN    540  255      10       10   \n",
       "333  306    AggronMega Aggron   Steel     NaN    630   70     140      230   \n",
       "224  208  SteelixMega Steelix   Steel  Ground    610   75     125      230   \n",
       "\n",
       "     Sp. Atk  Sp. Def  Speed  Deviation  \n",
       "230       10      230      5      215.0  \n",
       "121       35      105     50      207.5  \n",
       "261       75      135     55      190.0  \n",
       "333       60       80     50      155.0  \n",
       "224       55       95     30      145.0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Deviation'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].apply(lambda x:np.max((x-x.median()).abs()), 1)\n",
    "df.sort_values('Deviation', ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：指数加权窗口\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0   -1.000000\n",
       "1   -1.000000\n",
       "2   -1.409836\n",
       "3   -1.609756\n",
       "4   -1.725845\n",
       "dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(0)\n",
    "s = pd.Series(np.random.randint(-1,2,30).cumsum())\n",
    "s.ewm(alpha=0.2).mean().head()\n",
    "def ewm_func(x, alpha=0.2):\n",
    "    win = (1-alpha)**np.arange(x.shape[0])[::-1]\n",
    "    res = (win*x).sum()/win.sum()\n",
    "    return res\n",
    "s.expanding().apply(ewm_func).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.\n",
    "\n",
    "新的权重为$w_i = (1 - \\alpha)^i, i\\in \\{0,1,...,n-1\\}$，$y_t$更新如下：\n",
    "$$\n",
    "\\begin{split}y_t &=\\frac{\\sum_{i=0}^{n-1} w_i x_{t-i}}{\\sum_{i=0}^{n-1} w_i} \\\\\n",
    "&=\\frac{x_t + (1 - \\alpha)x_{t-1} + (1 - \\alpha)^2 x_{t-2} + ...\n",
    "+ (1 - \\alpha)^{n-1} x_{t-(n-1)}}{1 + (1 - \\alpha) + (1 - \\alpha)^2 + ...\n",
    "+ (1 - \\alpha)^{n-1}}\\\\\\end{split}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         NaN\n",
       "1         NaN\n",
       "2         NaN\n",
       "3   -1.609756\n",
       "4   -1.826558\n",
       "dtype: float64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.rolling(window=4).apply(ewm_func).head() # 无需对原函数改动"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第三章 索引\n",
    "### Ex1：公司员工数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EmployeeID</th>\n",
       "      <th>birthdate_key</th>\n",
       "      <th>age</th>\n",
       "      <th>city_name</th>\n",
       "      <th>department</th>\n",
       "      <th>job_title</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3611</th>\n",
       "      <td>5791</td>\n",
       "      <td>1/14/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Kelowna</td>\n",
       "      <td>Dairy</td>\n",
       "      <td>Dairy Person</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3613</th>\n",
       "      <td>5793</td>\n",
       "      <td>1/22/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Richmond</td>\n",
       "      <td>Bakery</td>\n",
       "      <td>Baker</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3615</th>\n",
       "      <td>5795</td>\n",
       "      <td>1/30/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Nanaimo</td>\n",
       "      <td>Dairy</td>\n",
       "      <td>Dairy Person</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      EmployeeID birthdate_key  age city_name department     job_title gender\n",
       "3611        5791     1/14/1975   40   Kelowna      Dairy  Dairy Person      M\n",
       "3613        5793     1/22/1975   40  Richmond     Bakery         Baker      M\n",
       "3615        5795     1/30/1975   40   Nanaimo      Dairy  Dairy Person      M"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/company.csv')\n",
    "dpt = ['Dairy', 'Bakery']\n",
    "df.query(\"(age <= 40)&(department == @dpt)&(gender=='M')\").head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EmployeeID</th>\n",
       "      <th>birthdate_key</th>\n",
       "      <th>age</th>\n",
       "      <th>city_name</th>\n",
       "      <th>department</th>\n",
       "      <th>job_title</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3611</th>\n",
       "      <td>5791</td>\n",
       "      <td>1/14/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Kelowna</td>\n",
       "      <td>Dairy</td>\n",
       "      <td>Dairy Person</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3613</th>\n",
       "      <td>5793</td>\n",
       "      <td>1/22/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Richmond</td>\n",
       "      <td>Bakery</td>\n",
       "      <td>Baker</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3615</th>\n",
       "      <td>5795</td>\n",
       "      <td>1/30/1975</td>\n",
       "      <td>40</td>\n",
       "      <td>Nanaimo</td>\n",
       "      <td>Dairy</td>\n",
       "      <td>Dairy Person</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      EmployeeID birthdate_key  age city_name department     job_title gender\n",
       "3611        5791     1/14/1975   40   Kelowna      Dairy  Dairy Person      M\n",
       "3613        5793     1/22/1975   40  Richmond     Bakery         Baker      M\n",
       "3615        5795     1/30/1975   40   Nanaimo      Dairy  Dairy Person      M"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df.age<=40)&df.department.isin(dpt)&(df.gender=='M')].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EmployeeID</th>\n",
       "      <th>age</th>\n",
       "      <th>job_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1319</td>\n",
       "      <td>58</td>\n",
       "      <td>VP Stores</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1321</td>\n",
       "      <td>56</td>\n",
       "      <td>VP Human Resources</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1323</td>\n",
       "      <td>53</td>\n",
       "      <td>Exec Assistant, VP Stores</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1325</td>\n",
       "      <td>51</td>\n",
       "      <td>Exec Assistant, Legal Counsel</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1329</td>\n",
       "      <td>48</td>\n",
       "      <td>Store Manager</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EmployeeID  age                      job_title\n",
       "1        1319   58                      VP Stores\n",
       "3        1321   56             VP Human Resources\n",
       "5        1323   53      Exec Assistant, VP Stores\n",
       "6        1325   51  Exec Assistant, Legal Counsel\n",
       "8        1329   48                  Store Manager"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[(df.EmployeeID%2==1).values,[0,2,-2]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_op = df.copy()\n",
    "df_op = df_op.set_index(df_op.columns[-3:].tolist()).swaplevel(0,2,axis=0)\n",
    "df_op = df_op.reset_index(level=1)\n",
    "df_op = df_op.rename_axis(index={'gender':'Gender'})\n",
    "df_op.index = df_op.index.map(lambda x:'_'.join(x))\n",
    "df_op.index = df_op.index.map(lambda x:tuple(x.split('_')))\n",
    "df_op = df_op.rename_axis(index=['gender', 'department'])\n",
    "df_op = df_op.reset_index().reindex(df.columns, axis=1)\n",
    "df_op.equals(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：巧克力数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Company</th>\n",
       "      <th>Review Date</th>\n",
       "      <th>Cocoa Percent</th>\n",
       "      <th>Company Location</th>\n",
       "      <th>Rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A. Morin</td>\n",
       "      <td>2016</td>\n",
       "      <td>63%</td>\n",
       "      <td>France</td>\n",
       "      <td>3.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A. Morin</td>\n",
       "      <td>2015</td>\n",
       "      <td>70%</td>\n",
       "      <td>France</td>\n",
       "      <td>2.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A. Morin</td>\n",
       "      <td>2015</td>\n",
       "      <td>70%</td>\n",
       "      <td>France</td>\n",
       "      <td>3.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Company  Review Date Cocoa Percent Company Location  Rating\n",
       "0  A. Morin         2016           63%           France    3.75\n",
       "1  A. Morin         2015           70%           France    2.75\n",
       "2  A. Morin         2015           70%           France    3.00"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/chocolate.csv')\n",
    "df.columns = [' '.join(i.split('\\n')) for i in df.columns]\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Company</th>\n",
       "      <th>Review Date</th>\n",
       "      <th>Cocoa Percent</th>\n",
       "      <th>Company Location</th>\n",
       "      <th>Rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Akesson's (Pralus)</td>\n",
       "      <td>2010</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Akesson's (Pralus)</td>\n",
       "      <td>2010</td>\n",
       "      <td>0.75</td>\n",
       "      <td>Switzerland</td>\n",
       "      <td>2.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Alain Ducasse</td>\n",
       "      <td>2014</td>\n",
       "      <td>0.75</td>\n",
       "      <td>France</td>\n",
       "      <td>2.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Company  Review Date  Cocoa Percent Company Location  Rating\n",
       "33  Akesson's (Pralus)         2010           0.75      Switzerland    2.75\n",
       "34  Akesson's (Pralus)         2010           0.75      Switzerland    2.75\n",
       "36       Alain Ducasse         2014           0.75           France    2.75"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)\n",
    "df.query('(Rating<3)&(`Cocoa Percent`>`Cocoa Percent`.median())').head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Company</th>\n",
       "      <th>Cocoa Percent</th>\n",
       "      <th>Rating</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Review Date</th>\n",
       "      <th>Company Location</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">2012</th>\n",
       "      <th>Australia</th>\n",
       "      <td>Bahen &amp; Co.</td>\n",
       "      <td>0.7</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Australia</th>\n",
       "      <td>Bahen &amp; Co.</td>\n",
       "      <td>0.7</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Australia</th>\n",
       "      <td>Bahen &amp; Co.</td>\n",
       "      <td>0.7</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  Company  Cocoa Percent  Rating\n",
       "Review Date Company Location                                    \n",
       "2012        Australia         Bahen & Co.            0.7     3.0\n",
       "            Australia         Bahen & Co.            0.7     2.5\n",
       "            Australia         Bahen & Co.            0.7     2.5"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx = pd.IndexSlice\n",
    "exclude = ['France', 'Canada', 'Amsterdam', 'Belgium']\n",
    "res = df.set_index(['Review Date', 'Company Location']).sort_index(level=0)\n",
    "res.loc[idx[2012:,~res.index.get_level_values(1).isin(exclude)],:].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第四章 分组\n",
    "### Ex1：汽车数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CoV</th>\n",
       "      <th>mean</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Japan</th>\n",
       "      <td>0.387429</td>\n",
       "      <td>13938.052632</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Japan/USA</th>\n",
       "      <td>0.240040</td>\n",
       "      <td>10067.571429</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Korea</th>\n",
       "      <td>0.243435</td>\n",
       "      <td>7857.333333</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>0.203344</td>\n",
       "      <td>12543.269231</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                CoV          mean  count\n",
       "Country                                 \n",
       "Japan      0.387429  13938.052632     19\n",
       "Japan/USA  0.240040  10067.571429      7\n",
       "Korea      0.243435   7857.333333      3\n",
       "USA        0.203344  12543.269231     26"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/car.csv')\n",
    "df.groupby('Country').filter(lambda x:x.shape[0]>2).groupby('Country')['Price'].agg([('CoV', lambda x: x.std()/x.mean()), 'mean', 'count'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "60"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Head     9069.95\n",
       "Mid     13356.40\n",
       "Tail    15420.65\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "condition = ['Head']*20+['Mid']*20+['Tail']*20\n",
    "df.groupby(condition)['Price'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Price_max</th>\n",
       "      <th>HP_min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Type</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Compact</th>\n",
       "      <td>18900</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Large</th>\n",
       "      <td>17257</td>\n",
       "      <td>150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medium</th>\n",
       "      <td>24760</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Small</th>\n",
       "      <td>9995</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sporty</th>\n",
       "      <td>13945</td>\n",
       "      <td>92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Van</th>\n",
       "      <td>15395</td>\n",
       "      <td>106</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Price_max  HP_min\n",
       "Type                      \n",
       "Compact      18900      95\n",
       "Large        17257     150\n",
       "Medium       24760     110\n",
       "Small         9995      63\n",
       "Sporty       13945      92\n",
       "Van          15395     106"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = df.groupby('Type').agg({'Price': ['max'], 'HP': ['min']})\n",
    "res.columns = res.columns.map(lambda x:'_'.join(x))\n",
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.00\n",
       "1    0.54\n",
       "2    0.00\n",
       "3    0.58\n",
       "4    0.80\n",
       "Name: HP, dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def normalize(s):\n",
    "    s_min, s_max = s.min(), s.max()\n",
    "    res = (s - s_min)/(s_max - s_min)\n",
    "    return res\n",
    "df.groupby('Type')['HP'].transform(normalize).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Type\n",
       "Compact    0.586087\n",
       "Large     -0.242765\n",
       "Medium     0.370491\n",
       "Small      0.603916\n",
       "Sporty     0.871426\n",
       "Van        0.819881\n",
       "dtype: float64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('Type')[['HP', 'Disp.']].apply(lambda x:np.corrcoef(x['HP'].values, x['Disp.'].values)[0,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：实现transform函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<__main__.my_groupby at 0x1817feca9a0>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class my_groupby:\n",
    "    def __init__(self, my_df, group_cols):\n",
    "        self.my_df = my_df.copy()\n",
    "        self.groups = my_df[group_cols].drop_duplicates()\n",
    "        if isinstance(self.groups, pd.Series):\n",
    "            self.groups = self.groups.to_frame()\n",
    "        self.group_cols = self.groups.columns.tolist()\n",
    "        self.groups = {i: self.groups[i].values.tolist() for i in self.groups.columns}\n",
    "        self.transform_col = None\n",
    "    def __getitem__(self, col):\n",
    "        self.pr_col = [col] if isinstance(col, str) else list(col)\n",
    "        return self\n",
    "    def transform(self, my_func):\n",
    "        self.num = len(self.groups[self.group_cols[0]])\n",
    "        L_order, L_value = np.array([]), np.array([])\n",
    "        for i in range(self.num):\n",
    "            group_df = self.my_df.reset_index().copy()\n",
    "            for col in self.group_cols:\n",
    "                group_df = group_df[group_df[col]==self.groups[col][i]]\n",
    "            group_df = group_df[self.pr_col]\n",
    "            if group_df.shape[1] == 1:\n",
    "                group_df = group_df.iloc[:, 0]\n",
    "            group_res = my_func(group_df)\n",
    "            if not isinstance(group_res, pd.Series):\n",
    "                group_res = pd.Series(group_res,index=group_df.index,name=group_df.name)\n",
    "            L_order = np.r_[L_order, group_res.index]\n",
    "            L_value = np.r_[L_value, group_res.values]\n",
    "        self.res = pd.Series(pd.Series(L_value, index=L_order).sort_index().values,index=self.my_df.reset_index().index, name=my_func.__name__)\n",
    "        return self.res\n",
    "\n",
    "my_groupby(df, 'Type')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "单列分组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.733592\n",
       "1    0.372003\n",
       "2    0.109712\n",
       "3    0.186244\n",
       "4    0.177525\n",
       "Name: f, dtype: float64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def f(s):\n",
    "    res = (s-s.min())/(s.max()-s.min())\n",
    "    return res\n",
    "my_groupby(df, 'Type')['Price'].transform(f).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.733592\n",
       "1    0.372003\n",
       "2    0.109712\n",
       "3    0.186244\n",
       "4    0.177525\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('Type')['Price'].transform(f).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "多列分组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.000000\n",
       "1    0.000000\n",
       "2    0.000000\n",
       "3    0.000000\n",
       "4    0.196357\n",
       "Name: f, dtype: float64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_groupby(df, ['Type','Country'])['Price'].transform(f).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.000000\n",
       "1    0.000000\n",
       "2    0.000000\n",
       "3    0.000000\n",
       "4    0.196357\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['Type','Country'])['Price'].transform(f).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "标量广播："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    7682.384615\n",
       "1    7682.384615\n",
       "2    7682.384615\n",
       "3    7682.384615\n",
       "4    7682.384615\n",
       "Name: <lambda>, dtype: float64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_groupby(df, 'Type')['Price'].transform(lambda x:x.mean()).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    7682.384615\n",
       "1    7682.384615\n",
       "2    7682.384615\n",
       "3    7682.384615\n",
       "4    7682.384615\n",
       "Name: Price, dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('Type')['Price'].transform(lambda x:x.mean()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "跨列计算："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.858407\n",
       "1    1.266667\n",
       "2    1.285714\n",
       "3    0.989130\n",
       "4    1.097087\n",
       "Name: <lambda>, dtype: float64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_groupby(df, 'Type')['Disp.', 'HP'].transform(lambda x: x['Disp.']/x.HP).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第五章 变形\n",
    "### Ex1：美国非法药物数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>COUNTY</th>\n",
       "      <th>SubstanceName</th>\n",
       "      <th>2010</th>\n",
       "      <th>2011</th>\n",
       "      <th>2012</th>\n",
       "      <th>2013</th>\n",
       "      <th>2014</th>\n",
       "      <th>2015</th>\n",
       "      <th>2016</th>\n",
       "      <th>2017</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>KY</td>\n",
       "      <td>ADAIR</td>\n",
       "      <td>Buprenorphine</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>KY</td>\n",
       "      <td>ADAIR</td>\n",
       "      <td>Codeine</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>KY</td>\n",
       "      <td>ADAIR</td>\n",
       "      <td>Fentanyl</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>KY</td>\n",
       "      <td>ADAIR</td>\n",
       "      <td>Heroin</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>KY</td>\n",
       "      <td>ADAIR</td>\n",
       "      <td>Hydrocodone</td>\n",
       "      <td>6.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  State COUNTY  SubstanceName  2010  2011  2012  2013  2014  2015  2016  2017\n",
       "0    KY  ADAIR  Buprenorphine   NaN   3.0   5.0   4.0  27.0   5.0   7.0  10.0\n",
       "1    KY  ADAIR        Codeine   NaN   NaN   1.0   NaN   NaN   NaN   NaN   1.0\n",
       "2    KY  ADAIR       Fentanyl   NaN   NaN   1.0   NaN   NaN   NaN   NaN   NaN\n",
       "3    KY  ADAIR         Heroin   NaN   NaN   1.0   2.0   NaN   1.0   NaN   2.0\n",
       "4    KY  ADAIR    Hydrocodone   6.0   9.0  10.0  10.0   9.0   7.0  11.0   3.0"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/drugs.csv').sort_values(['State','COUNTY','SubstanceName'],ignore_index=True)\n",
    "res = df.pivot(index=['State','COUNTY','SubstanceName'], columns='YYYY', values='DrugReports').reset_index().rename_axis(columns={'YYYY':''})\n",
    "res.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res_melted = res.melt(id_vars = ['State','COUNTY','SubstanceName'],\n",
    "                      value_vars = res.columns[-8:],\n",
    "                      var_name = 'YYYY',\n",
    "                      value_name = 'DrugReports').dropna(\n",
    "                      subset=['DrugReports'])\n",
    "res_melted = res_melted[df.columns].sort_values(['State','COUNTY','SubstanceName'],ignore_index=True).astype({'YYYY':'int64', 'DrugReports':'int64'})\n",
    "res_melted.equals(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.\n",
    "策略一："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>State</th>\n",
       "      <th>KY</th>\n",
       "      <th>OH</th>\n",
       "      <th>PA</th>\n",
       "      <th>VA</th>\n",
       "      <th>WV</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YYYY</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>10453</td>\n",
       "      <td>19707</td>\n",
       "      <td>19814</td>\n",
       "      <td>8685</td>\n",
       "      <td>2890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>10289</td>\n",
       "      <td>20330</td>\n",
       "      <td>19987</td>\n",
       "      <td>6749</td>\n",
       "      <td>3271</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012</th>\n",
       "      <td>10722</td>\n",
       "      <td>23145</td>\n",
       "      <td>19959</td>\n",
       "      <td>7831</td>\n",
       "      <td>3376</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "State     KY     OH     PA    VA    WV\n",
       "YYYY                                  \n",
       "2010   10453  19707  19814  8685  2890\n",
       "2011   10289  20330  19987  6749  3271\n",
       "2012   10722  23145  19959  7831  3376"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = df.pivot_table(index='YYYY', columns='State', values='DrugReports', aggfunc='sum')\n",
    "res.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "策略二："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>State</th>\n",
       "      <th>KY</th>\n",
       "      <th>OH</th>\n",
       "      <th>PA</th>\n",
       "      <th>VA</th>\n",
       "      <th>WV</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YYYY</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>10453</td>\n",
       "      <td>19707</td>\n",
       "      <td>19814</td>\n",
       "      <td>8685</td>\n",
       "      <td>2890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>10289</td>\n",
       "      <td>20330</td>\n",
       "      <td>19987</td>\n",
       "      <td>6749</td>\n",
       "      <td>3271</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2012</th>\n",
       "      <td>10722</td>\n",
       "      <td>23145</td>\n",
       "      <td>19959</td>\n",
       "      <td>7831</td>\n",
       "      <td>3376</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "State     KY     OH     PA    VA    WV\n",
       "YYYY                                  \n",
       "2010   10453  19707  19814  8685  2890\n",
       "2011   10289  20330  19987  6749  3271\n",
       "2012   10722  23145  19959  7831  3376"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = df.groupby(['State', 'YYYY'])['DrugReports'].sum().to_frame().unstack(0).droplevel(0,axis=1)\n",
    "res.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：特殊的wide_to_long方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Class</th>\n",
       "      <th>Name</th>\n",
       "      <th>Chinese</th>\n",
       "      <th>Math</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>San Zhang</td>\n",
       "      <td>80</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Si Li</td>\n",
       "      <td>90</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Class       Name  Chinese  Math\n",
       "0      1  San Zhang       80    80\n",
       "1      2      Si Li       90    75"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'Class':[1,2],\n",
    "                   'Name':['San Zhang', 'Si Li'],\n",
    "                   'Chinese':[80, 90],\n",
    "                   'Math':[80, 75]})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Class</th>\n",
       "      <th>Name</th>\n",
       "      <th>Subject</th>\n",
       "      <th>Grade</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>San Zhang</td>\n",
       "      <td>Chinese</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>San Zhang</td>\n",
       "      <td>Math</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Si Li</td>\n",
       "      <td>Chinese</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Si Li</td>\n",
       "      <td>Math</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Class       Name  Subject  Grade\n",
       "0      1  San Zhang  Chinese     80\n",
       "1      1  San Zhang     Math     80\n",
       "2      2      Si Li  Chinese     90\n",
       "3      2      Si Li     Math     75"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.rename(columns={'Chinese':'pre_Chinese', 'Math':'pre_Math'})\n",
    "pd.wide_to_long(df,\n",
    "                stubnames=['pre'],\n",
    "                i = ['Class', 'Name'],\n",
    "                j='Subject',\n",
    "                sep='_',\n",
    "                suffix='.+').reset_index().rename(columns={'pre':'Grade'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第六章 连接\n",
    "### Ex1：美国疫情数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "date = pd.date_range('20200412', '20201116').to_series()\n",
    "date = date.dt.month.astype('string').str.zfill(2) +'-'+ date.dt.day.astype('string').str.zfill(2) +'-'+ '2020'\n",
    "date = date.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Confirmed</th>\n",
       "      <th>Deaths</th>\n",
       "      <th>Recovered</th>\n",
       "      <th>Active</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>04-12-2020</th>\n",
       "      <td>189033</td>\n",
       "      <td>9385</td>\n",
       "      <td>23887.0</td>\n",
       "      <td>179648</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>04-13-2020</th>\n",
       "      <td>195749</td>\n",
       "      <td>10058</td>\n",
       "      <td>23887.0</td>\n",
       "      <td>185691.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>04-14-2020</th>\n",
       "      <td>203020</td>\n",
       "      <td>10842</td>\n",
       "      <td>23887.0</td>\n",
       "      <td>192178.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>04-15-2020</th>\n",
       "      <td>214454</td>\n",
       "      <td>11617</td>\n",
       "      <td>23887.0</td>\n",
       "      <td>202837.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>04-16-2020</th>\n",
       "      <td>223691</td>\n",
       "      <td>14832</td>\n",
       "      <td>23887.0</td>\n",
       "      <td>208859.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Confirmed Deaths Recovered    Active\n",
       "04-12-2020    189033   9385   23887.0    179648\n",
       "04-13-2020    195749  10058   23887.0  185691.0\n",
       "04-14-2020    203020  10842   23887.0  192178.0\n",
       "04-15-2020    214454  11617   23887.0  202837.0\n",
       "04-16-2020    223691  14832   23887.0  208859.0"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L = []\n",
    "for d in date:\n",
    "    df = pd.read_csv('../data/us_report/' + d + '.csv', index_col='Province_State')\n",
    "    data = df.loc['New York', ['Confirmed','Deaths','Recovered','Active']]\n",
    "    L.append(data.to_frame().T)\n",
    "res = pd.concat(L)\n",
    "res.index = date\n",
    "res.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：实现join函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "def join(df1, df2, how='left'):\n",
    "    res_col = df1.columns.tolist() +  df2.columns.tolist()\n",
    "    dup = df1.index.unique().intersection(df2.index.unique())\n",
    "    res_df = pd.DataFrame(columns = res_col)\n",
    "    for label in dup:\n",
    "        cartesian = [list(i)+list(j) for i in df1.loc[label].values.reshape(-1,1) for j in df2.loc[label].values.reshape(-1,1)]\n",
    "        dup_df = pd.DataFrame(cartesian, index = [label]*len(cartesian), columns = res_col)\n",
    "        res_df = pd.concat([res_df,dup_df])\n",
    "    if how in ['left', 'outer']:\n",
    "        for label in df1.index.unique().difference(dup):\n",
    "            if isinstance(df1.loc[label], pd.DataFrame):\n",
    "                cat = [list(i)+[np.nan]*df2.shape[1] for i in df1.loc[label].values]\n",
    "            else:\n",
    "                cat = [list(i)+[np.nan]*df2.shape[1] for i in df1.loc[label].to_frame().values]\n",
    "            dup_df = pd.DataFrame(cat, index = [label]*len(cat), columns = res_col)\n",
    "            res_df = pd.concat([res_df,dup_df])\n",
    "    if how in ['right', 'outer']:\n",
    "        for label in df2.index.unique().difference(dup):\n",
    "            if isinstance(df2.loc[label], pd.DataFrame):\n",
    "                cat = [[np.nan]+list(i)*df1.shape[1] for i in df2.loc[label].values]\n",
    "            else:\n",
    "                cat = [[np.nan]+list(i)*df1.shape[1] for i in df2.loc[label].to_frame().values]\n",
    "            dup_df = pd.DataFrame(cat, index = [label]*len(cat), columns = res_col)\n",
    "            res_df = pd.concat([res_df,dup_df])\n",
    "    return res_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "A     1\n",
       "A     2\n",
       "B     3\n",
       "C     4\n",
       "D     5"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = pd.DataFrame({'col1':[1,2,3,4,5]}, index=list('AABCD'))\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>p</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>r</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>E</th>\n",
       "      <td>t</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col2\n",
       "A    o\n",
       "B    p\n",
       "B    q\n",
       "C    r\n",
       "E    s\n",
       "E    t"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame({'col2':list('opqrst')}, index=list('ABBCEE'))\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>1</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>2</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>3</td>\n",
       "      <td>p</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>3</td>\n",
       "      <td>q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>4</td>\n",
       "      <td>r</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  col1 col2\n",
       "A    1    o\n",
       "A    2    o\n",
       "B    3    p\n",
       "B    3    q\n",
       "C    4    r"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "join(df1, df2, how='inner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第七章 缺失数据\n",
    "### Ex1：缺失值与类别的相关性检验"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9712760884395901"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/missing_chi.csv')\n",
    "cat_1 = df.X_1.fillna('NaN').mask(df.X_1.notna()).fillna(\"NotNaN\")\n",
    "cat_2 = df.X_2.fillna('NaN').mask(df.X_2.notna()).fillna(\"NotNaN\")\n",
    "df_1 = pd.crosstab(cat_1, df.y, margins=True)\n",
    "df_2 = pd.crosstab(cat_2, df.y, margins=True)\n",
    "def compute_S(my_df):\n",
    "    S = []\n",
    "    for i in range(2):\n",
    "        for j in range(2):\n",
    "            E = my_df.iat[i, j]\n",
    "            F = my_df.iat[i, 2]*my_df.iat[2, j]/my_df.iat[2,2]\n",
    "            S.append((E-F)**2/F)\n",
    "    return sum(S)\n",
    "res1 = compute_S(df_1)\n",
    "res2 = compute_S(df_2)\n",
    "from scipy.stats import chi2\n",
    "chi2.sf(res1, 1) # X_1检验的p值 # 不能认为相关，剔除"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.459641265637543e-166"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chi2.sf(res2, 1) # X_2检验的p值 # 认为相关，保留"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果与`scipy.stats.chi2_contingency`在不使用$Yates$修正的情况下完全一致："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.459641265637543e-166"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import chi2_contingency\n",
    "chi2_contingency(pd.crosstab(cat_1, df.y), correction=False)[1]\n",
    "chi2_contingency(pd.crosstab(cat_2, df.y), correction=False)[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：用回归模型解决分类问题\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\gyh\\miniconda3\\envs\\jp\\lib\\site-packages\\sklearn\\base.py:450: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names\n",
      "  warnings.warn(\n",
      "c:\\Users\\gyh\\miniconda3\\envs\\jp\\lib\\site-packages\\sklearn\\base.py:450: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names\n",
      "  warnings.warn(\n",
      "c:\\Users\\gyh\\miniconda3\\envs\\jp\\lib\\site-packages\\sklearn\\base.py:450: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'Yellow'"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "df = pd.read_excel('../data/color.xlsx')\n",
    "df_dummies = pd.get_dummies(df.Color)\n",
    "stack_list = []\n",
    "for col in df_dummies.columns:\n",
    "    clf = KNeighborsRegressor(n_neighbors=6)\n",
    "    clf.fit(df.iloc[:,:2], df_dummies[col])\n",
    "    res = clf.predict([[0.8, -0.2]]).reshape(-1,1)\n",
    "    stack_list.append(res)\n",
    "code_res = pd.Series(np.hstack(stack_list).argmax(1))\n",
    "df_dummies.columns[code_res[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ID            0\n",
       "Age           0\n",
       "Employment    0\n",
       "Marital       0\n",
       "Income        0\n",
       "Gender        0\n",
       "Hours         0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "df = pd.read_csv('../data/audit.csv')\n",
    "res_df = df.copy()\n",
    "df = pd.concat([pd.get_dummies(df[['Marital', 'Gender']]), df[['Age','Income','Hours']].apply(lambda x:(x-x.min())/(x.max()-x.min())), df.Employment],1)\n",
    "X_train = df.query('Employment.notna()')\n",
    "X_test = df.query('Employment.isna()')\n",
    "df_dummies = pd.get_dummies(X_train.Employment)\n",
    "stack_list = []\n",
    "for col in df_dummies.columns:\n",
    "    clf = KNeighborsRegressor(n_neighbors=6)\n",
    "    clf.fit(X_train.iloc[:,:-1], df_dummies[col])\n",
    "    res = clf.predict(X_test.iloc[:,:-1]).reshape(-1,1)\n",
    "    stack_list.append(res)\n",
    "code_res = pd.Series(np.hstack(stack_list).argmax(1))\n",
    "cat_res = code_res.replace(dict(zip(list(range(df_dummies.shape[0])),df_dummies.columns)))\n",
    "res_df.loc[res_df.Employment.isna(), 'Employment'] = cat_res.values\n",
    "res_df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第八章 文本数据\n",
    "### Ex1：房屋信息数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>floor</th>\n",
       "      <th>year</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>高层（共6层）</td>\n",
       "      <td>1986</td>\n",
       "      <td>58.23㎡</td>\n",
       "      <td>155万</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>中层（共20层）</td>\n",
       "      <td>2020</td>\n",
       "      <td>88㎡</td>\n",
       "      <td>155万</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>低层（共28层）</td>\n",
       "      <td>2010</td>\n",
       "      <td>89.33㎡</td>\n",
       "      <td>365万</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      floor  year    area price\n",
       "0   高层（共6层）  1986  58.23㎡  155万\n",
       "1  中层（共20层）  2020     88㎡  155万\n",
       "2  低层（共28层）  2010  89.33㎡  365万"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel('../data/house_info.xls', usecols=['floor','year','area','price'])\n",
    "df.year = pd.to_numeric(df.year.str[:-2]).astype('Int64')\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "      <th>Level</th>\n",
       "      <th>Highest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1986</td>\n",
       "      <td>58.23㎡</td>\n",
       "      <td>155万</td>\n",
       "      <td>高层</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020</td>\n",
       "      <td>88㎡</td>\n",
       "      <td>155万</td>\n",
       "      <td>中层</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2010</td>\n",
       "      <td>89.33㎡</td>\n",
       "      <td>365万</td>\n",
       "      <td>低层</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year    area price Level Highest\n",
       "0  1986  58.23㎡  155万    高层       6\n",
       "1  2020     88㎡  155万    中层      20\n",
       "2  2010  89.33㎡  365万    低层      28"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pat = '(\\w层)（共(\\d+)层）'\n",
    "new_cols = df.floor.str.extract(pat).rename(columns={0:'Level', 1:'Highest'})\n",
    "df = pd.concat([df.drop(columns=['floor']), new_cols], 1)\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "      <th>Level</th>\n",
       "      <th>Highest</th>\n",
       "      <th>avg_price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1986</td>\n",
       "      <td>58.23㎡</td>\n",
       "      <td>155万</td>\n",
       "      <td>高层</td>\n",
       "      <td>6</td>\n",
       "      <td>26618元/平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020</td>\n",
       "      <td>88㎡</td>\n",
       "      <td>155万</td>\n",
       "      <td>中层</td>\n",
       "      <td>20</td>\n",
       "      <td>17613元/平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2010</td>\n",
       "      <td>89.33㎡</td>\n",
       "      <td>365万</td>\n",
       "      <td>低层</td>\n",
       "      <td>28</td>\n",
       "      <td>40859元/平米</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year    area price Level Highest  avg_price\n",
       "0  1986  58.23㎡  155万    高层       6  26618元/平米\n",
       "1  2020     88㎡  155万    中层      20  17613元/平米\n",
       "2  2010  89.33㎡  365万    低层      28  40859元/平米"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s_area = pd.to_numeric(df.area.str[:-1])\n",
    "s_price = pd.to_numeric(df.price.str[:-1])\n",
    "df['avg_price'] = ((s_price/s_area)*10000).astype('int').astype('string') + '元/平米'\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：《权力的游戏》剧本数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Season    Episode   \n",
       "Season 1  Episode 1     327\n",
       "          Episode 10    266\n",
       "          Episode 2     283\n",
       "          Episode 3     353\n",
       "          Episode 4     404\n",
       "Name: Sentence, dtype: int64"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/script.csv')\n",
    "df.columns = df.columns.str.strip()\n",
    "df.groupby(['Season', 'Episode'])['Sentence'].count().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name\n",
       "male singer          109.000000\n",
       "slave owner           77.000000\n",
       "manderly              62.000000\n",
       "lollys stokeworth     62.000000\n",
       "dothraki matron       56.666667\n",
       "Name: Sentence, dtype: float64"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.set_index('Name').Sentence.str.split().str.len().groupby('Name').mean().sort_values(ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name\n",
       "tyrion lannister    527\n",
       "jon snow            374\n",
       "jaime lannister     283\n",
       "arya stark          265\n",
       "cersei lannister    246\n",
       "dtype: int64"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = pd.Series(df.Sentence.values, index=df.Name.shift(-1))\n",
    "s.str.count('\\?').groupby('Name').sum().sort_values(ascending=False).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第九章 分类数据\n",
    "### Ex1：统计未出现的类别"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>B</th>\n",
       "      <th>cat</th>\n",
       "      <th>dog</th>\n",
       "      <th>sheep</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "B  cat  dog  sheep\n",
       "A                 \n",
       "a    2    0      0\n",
       "b    1    0      0\n",
       "c    0    1      0"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def my_crosstab(s1, s2, dropna=True):\n",
    "    idx1 = (s1.cat.categories if s1.dtype.name == 'category' and not dropna else s1.unique())\n",
    "    idx2 = (s2.cat.categories if s2.dtype.name == 'category' and not dropna else s2.unique())\n",
    "    res = pd.DataFrame(np.zeros((idx1.shape[0], idx2.shape[0])), index=idx1, columns=idx2)\n",
    "    for i, j in zip(s1, s2):\n",
    "        res.at[i, j] += 1\n",
    "    res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')\n",
    "    return res\n",
    "df = pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})\n",
    "df.B = df.B.astype('category').cat.add_categories('sheep')\n",
    "my_crosstab(df.A, df.B)\n",
    "my_crosstab(df.A, df.B, dropna=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：钻石数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/diamonds.csv')\n",
    "s_obj, s_cat = df.cut, df.cut.astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.31 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 30 s_obj.nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.97 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 30 s_cat.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carat</th>\n",
       "      <th>cut</th>\n",
       "      <th>clarity</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>315</th>\n",
       "      <td>0.96</td>\n",
       "      <td>Ideal</td>\n",
       "      <td>I1</td>\n",
       "      <td>2801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535</th>\n",
       "      <td>0.96</td>\n",
       "      <td>Ideal</td>\n",
       "      <td>I1</td>\n",
       "      <td>2826</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>551</th>\n",
       "      <td>0.97</td>\n",
       "      <td>Ideal</td>\n",
       "      <td>I1</td>\n",
       "      <td>2830</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     carat    cut clarity  price\n",
       "315   0.96  Ideal      I1   2801\n",
       "535   0.96  Ideal      I1   2826\n",
       "551   0.97  Ideal      I1   2830"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],ordered=True)\n",
    "df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)\n",
    "res = df.sort_values(['cut', 'clarity'], ascending=[False, True])\n",
    "res.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carat</th>\n",
       "      <th>cut</th>\n",
       "      <th>clarity</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>47407</th>\n",
       "      <td>0.52</td>\n",
       "      <td>Fair</td>\n",
       "      <td>IF</td>\n",
       "      <td>1849</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49683</th>\n",
       "      <td>0.52</td>\n",
       "      <td>Fair</td>\n",
       "      <td>IF</td>\n",
       "      <td>2144</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50126</th>\n",
       "      <td>0.47</td>\n",
       "      <td>Fair</td>\n",
       "      <td>IF</td>\n",
       "      <td>2211</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       carat   cut clarity  price\n",
       "47407   0.52  Fair      IF   1849\n",
       "49683   0.52  Fair      IF   2144\n",
       "50126   0.47  Fair      IF   2211"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])\n",
    "df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.cut = df.cut.cat.codes # 方法一：利用cat.codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "clarity_cat = df.clarity.cat.categories\n",
    "df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat))))) # 方法二：使用replace映射"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carat</th>\n",
       "      <th>cut</th>\n",
       "      <th>clarity</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.23</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.21</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.23</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>327</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   carat  cut  clarity  price\n",
       "0   0.23    0        6    326\n",
       "1   0.21    1        5    326\n",
       "2   0.23    3        3    327"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>carat</th>\n",
       "      <th>cut</th>\n",
       "      <th>clarity</th>\n",
       "      <th>price</th>\n",
       "      <th>avg_cut</th>\n",
       "      <th>avg_qcut</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.23</td>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>326</td>\n",
       "      <td>Low</td>\n",
       "      <td>Very Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.21</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>326</td>\n",
       "      <td>Low</td>\n",
       "      <td>Very Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.23</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>327</td>\n",
       "      <td>Low</td>\n",
       "      <td>Very Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.29</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>334</td>\n",
       "      <td>Low</td>\n",
       "      <td>Very Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.31</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>335</td>\n",
       "      <td>Low</td>\n",
       "      <td>Very Low</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   carat  cut  clarity  price avg_cut  avg_qcut\n",
       "0   0.23    0        6    326     Low  Very Low\n",
       "1   0.21    1        5    326     Low  Very Low\n",
       "2   0.23    3        3    327     Low  Very Low\n",
       "3   0.29    1        4    334     Low  Very Low\n",
       "4   0.31    3        6    335     Low  Very Low"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q = [0, 0.2, 0.4, 0.6, 0.8, 1]\n",
    "point = [-np.infty, 1000, 3500, 5500, 18000, np.infty]\n",
    "avg = df.price / df.carat\n",
    "df['avg_cut'] = pd.cut(avg, bins=point, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])\n",
    "df['avg_qcut'] = pd.qcut(avg, q=q, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Low', 'Mid', 'High']\n",
       "Categories (3, object): ['Low' < 'Mid' < 'High']"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.avg_cut.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Very Low', 'Low', 'Mid', 'High', 'Very High'], dtype='object')"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.avg_cut.cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Low\n",
       "1    Low\n",
       "2    Low\n",
       "Name: avg_cut, dtype: category\n",
       "Categories (3, object): ['Low' < 'Mid' < 'High']"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Very High'])\n",
    "df.avg_cut.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2295.0\n",
       "1    2295.0\n",
       "2    2295.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "interval_avg = pd.IntervalIndex(pd.qcut(avg, q=q))\n",
    "interval_avg.right.to_series().reset_index(drop=True).head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1051.162\n",
       "1    1051.162\n",
       "2    1051.162\n",
       "dtype: float64"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "interval_avg.left.to_series().reset_index(drop=True).head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1243.838\n",
       "1    1243.838\n",
       "2    1243.838\n",
       "dtype: float64"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "interval_avg.length.to_series().reset_index(drop=True).head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第十章 时序数据\n",
    "### Ex1：太阳辐射数据集\n",
    "#### 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Radiation</th>\n",
       "      <th>Temperature</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Datetime</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-09-01 00:00:08</th>\n",
       "      <td>2.58</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-01 00:05:10</th>\n",
       "      <td>2.83</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-09-01 00:20:06</th>\n",
       "      <td>2.16</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Radiation  Temperature\n",
       "Datetime                                   \n",
       "2016-09-01 00:00:08       2.58           51\n",
       "2016-09-01 00:05:10       2.83           51\n",
       "2016-09-01 00:20:06       2.16           51"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/solar.csv', usecols=['Data','Time','Radiation','Temperature'])\n",
    "solar_date = df.Data.str.extract('([/|\\w]+\\s).+')[0]\n",
    "df['Data'] = pd.to_datetime(solar_date + df.Time)\n",
    "df = df.drop(columns='Time').rename(columns={'Data':'Datetime'}).set_index('Datetime').sort_index()\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.\n",
    "(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatetimeIndex(['2016-09-29 23:55:26', '2016-10-01 00:00:19',\n",
       "               '2016-11-29 19:05:02', '2016-12-01 00:00:02',\n",
       "               '2016-12-05 20:45:53', '2016-12-08 11:10:42'],\n",
       "              dtype='datetime64[ns]', name='Datetime', freq=None)"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = df.index.to_series().reset_index(drop=True).diff().dt.total_seconds()\n",
    "max_3 = s.nlargest(3).index\n",
    "df.index[max_3.union(max_3-1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAS8klEQVR4nO3df6yc1X3n8fcnQH6o+WEIt17WttaW4m2UVkqgFhBltT+CahyoaqRNKFVVHEplqaXdVrtS41SRUIFIzj/Jkt0NlRWcNVFasEgqvIENtUiqdKXyww6EFBw2N8QIWxDfxoY2i5rK9Lt/zHEyoff6zr0ez/X1eb+k0X2e85yZe754+DzPPXNmJlWFJKkPr1vqAUiSJsfQl6SOGPqS1BFDX5I6YuhLUkfOXeoBnMyFF15Ya9euXephSNKysn///r+tqqnZjp3Rob927Vr27du31MOQpGUlyXNzHXN6R5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOnJGvyNXZ5612+6ftf3g9qsnPBJJi+GVviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SerISKGfZEWSe5N8O8mBJO9NckGSvUm+036e3/omyaeTTCd5MsklQ4+zpfX/TpItp6soSdLsRr3Svx34SlW9E3g3cADYBjxUVeuBh9o+wAeA9e22FbgDIMkFwM3AZcClwM0nThSSpMmYN/STvA34t8CdAFX1j1X1ErAZ2NW67QKuadubgbtq4GFgRZKLgCuBvVV1tKqOAXuBTWOsRZI0j1Gu9NcBM8Dnkjye5LNJfgZYWVUvtD4vAivb9irg+aH7H2ptc7X/lCRbk+xLsm9mZmZh1UiSTmqU0D8XuAS4o6ouBv4fP5nKAaCqCqhxDKiqdlTVhqraMDU165e5S5IWaZTQPwQcqqpH2v69DE4C32/TNrSfR9rxw8Caofuvbm1ztUuSJmTe0K+qF4Hnk/xca7oCeBrYA5xYgbMFuK9t7wGub6t4LgdebtNADwIbk5zfXsDd2NokSRMy6qds/h7whSSvB54FbmBwwtid5EbgOeDa1vcB4CpgGnil9aWqjia5FXis9bulqo6OpQpJ0khGCv2qegLYMMuhK2bpW8BNczzOTmDnAsYnSRoj35ErSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqyEihn+Rgkm8leSLJvtZ2QZK9Sb7Tfp7f2pPk00mmkzyZ5JKhx9nS+n8nyZbTU5IkaS4LudL/D1X1nqra0Pa3AQ9V1XrgobYP8AFgfbttBe6AwUkCuBm4DLgUuPnEiUKSNBmnMr2zGdjVtncB1wy131UDDwMrklwEXAnsraqjVXUM2AtsOoXfL0laoFFDv4C/SLI/ydbWtrKqXmjbLwIr2/Yq4Pmh+x5qbXO1/5QkW5PsS7JvZmZmxOFJkkZx7oj9/k1VHU7ys8DeJN8ePlhVlaTGMaCq2gHsANiwYcNYHlOSNDDSlX5VHW4/jwB/zmBO/vtt2ob280jrfhhYM3T31a1trnZJ0oTMG/pJfibJW05sAxuBvwH2ACdW4GwB7mvbe4Dr2yqey4GX2zTQg8DGJOe3F3A3tjZJ0oSMMr2zEvjzJCf6/2lVfSXJY8DuJDcCzwHXtv4PAFcB08ArwA0AVXU0ya3AY63fLVV1dGyVSJLmNW/oV9WzwLtnaf8BcMUs7QXcNMdj7QR2LnyYkqRx8B25ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOjPrF6OrM2m33L/UQJJ0GXulLUke80tdYzPWXwcHtV094JJJOxit9SeqIoS9JHTH0Jakjhr4kdcTQl6SOjBz6Sc5J8niSL7f9dUkeSTKd5J4kr2/tb2j70+342qHH+GhrfybJlWOvRpJ0Ugu50v994MDQ/ieAT1XVO4BjwI2t/UbgWGv/VOtHkncB1wE/D2wCPpPknFMbviRpIUYK/SSrgauBz7b9AO8H7m1ddgHXtO3NbZ92/IrWfzNwd1X9qKq+B0wDl46hBknSiEa90v+vwB8C/9T23w68VFXH2/4hYFXbXgU8D9COv9z6/7h9lvv8WJKtSfYl2TczMzN6JZKkec0b+kl+GThSVfsnMB6qakdVbaiqDVNTU5P4lZLUjVE+huF9wK8kuQp4I/BW4HZgRZJz29X8auBw638YWAMcSnIu8DbgB0PtJwzfR5I0AfNe6VfVR6tqdVWtZfBC7Fer6teBrwEfbN22APe17T1tn3b8q1VVrf26trpnHbAeeHRslUiS5nUqH7j2EeDuJLcBjwN3tvY7gc8nmQaOMjhRUFVPJdkNPA0cB26qqldP4fdLkhZoQaFfVX8J/GXbfpZZVt9U1T8AH5rj/h8HPr7QQUqSxsN35EpSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUkXlDP8kbkzya5JtJnkryx619XZJHkkwnuSfJ61v7G9r+dDu+duixPtran0ly5WmrSpI0q1Gu9H8EvL+q3g28B9iU5HLgE8CnquodwDHgxtb/RuBYa/9U60eSdwHXAT8PbAI+k+ScMdYiSZrHvKFfAz9su+e1WwHvB+5t7buAa9r25rZPO35FkrT2u6vqR1X1PWAauHQcRUiSRjPSnH6Sc5I8ARwB9gLfBV6qquOtyyFgVdteBTwP0I6/DLx9uH2W+0iSJmCk0K+qV6vqPcBqBlfn7zxdA0qyNcm+JPtmZmZO16+RpC4taPVOVb0EfA14L7Aiybnt0GrgcNs+DKwBaMffBvxguH2W+wz/jh1VtaGqNkxNTS1keJKkeYyyemcqyYq2/Sbgl4ADDML/g63bFuC+tr2n7dOOf7WqqrVf11b3rAPWA4+OqQ5J0gjOnb8LFwG72kqb1wG7q+rLSZ4G7k5yG/A4cGfrfyfw+STTwFEGK3aoqqeS7AaeBo4DN1XVq+MtR5J0MvOGflU9CVw8S/uzzLL6pqr+AfjQHI/1ceDjCx+mJGkcfEeuJHXE0Jekjhj6ktQRQ1+SOmLoS1JHRlmyqbPY2m33L/UQJE2QV/qS1BGv9HVazfWXxMHtV094JJLAK31J6oqhL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkfm/RKVJGuAu4CVQAE7qur2JBcA9wBrgYPAtVV1LEmA24GrgFeAD1fVN9pjbQE+1h76tqraNd5yNBe/FlESjPbNWceB/1JV30jyFmB/kr3Ah4GHqmp7km3ANuAjwAeA9e12GXAHcFk7SdwMbGBw8tifZE9VHRt3UTrz+Y1a0tKYd3qnql44caVeVX8PHABWAZuBE1fqu4Br2vZm4K4aeBhYkeQi4Epgb1UdbUG/F9g0zmIkSSe3oDn9JGuBi4FHgJVV9UI79CKD6R8YnBCeH7rbodY2V/trf8fWJPuS7JuZmVnI8CRJ8xg59JO8Gfgi8AdV9XfDx6qqGEzZnLKq2lFVG6pqw9TU1DgeUpLUjBT6Sc5jEPhfqKovtebvt2kb2s8jrf0wsGbo7qtb21ztkqQJmTf022qcO4EDVfXJoUN7gC1tewtw31D79Rm4HHi5TQM9CGxMcn6S84GNrU2SNCGjrN55H/AbwLeSPNHa/gjYDuxOciPwHHBtO/YAg+Wa0wyWbN4AUFVHk9wKPNb63VJVR8dRhCRpNPOGflX9HyBzHL5ilv4F3DTHY+0Edi5kgJKk8fEduZLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI6cO1+HJDuBXwaOVNUvtLYLgHuAtcBB4NqqOpYkwO3AVcArwIer6hvtPluAj7WHva2qdo23FK3ddv9SD0HSGW7e0Af+J/DfgbuG2rYBD1XV9iTb2v5HgA8A69vtMuAO4LJ2krgZ2AAUsD/Jnqo6Nq5CdHaY68R1cPvVEx6JdHaad3qnqr4OHH1N82bgxJX6LuCaofa7auBhYEWSi4Argb1VdbQF/V5g0xjGL0lagMXO6a+sqhfa9ovAyra9Cnh+qN+h1jZX+z+TZGuSfUn2zczMLHJ4kqTZnPILuVVVDKZsxqKqdlTVhqraMDU1Na6HlSSx+ND/fpu2of080toPA2uG+q1ubXO1S5ImaLGhvwfY0ra3APcNtV+fgcuBl9s00IPAxiTnJzkf2NjaJEkTNMqSzT8D/j1wYZJDDFbhbAd2J7kReA64tnV/gMFyzWkGSzZvAKiqo0luBR5r/W6pqte+OCxJOs3mDf2q+rU5Dl0xS98CbprjcXYCOxc0OknSWPmOXEnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktSRUT5lU1pyfvqmNB6G/jLk5+ZLWiyndySpI4a+JHXE0Jekjhj6ktQRQ1+SOuLqnVNwupcRukpnfi7llBbGK31J6ohX+tI8/GtCZxNDX1okTwZajpzekaSOeKWvs5JX4dLsDP0RLHQVjYGzPLlaSj0w9M8Aho2kSTH0J8hwl7TUDH11xROvemfoDzEQJJ3tJh76STYBtwPnAJ+tqu2n63cZ4loKvpCvM9lEQz/JOcD/AH4JOAQ8lmRPVT09yXFIS2FcFyFznTw82WgUk77SvxSYrqpnAZLcDWwGDH1pRONaQjwJZ9oJaqH/Lc7GE+akQ38V8PzQ/iHgsuEOSbYCW9vuD5M8M6GxjcuFwN8u9SAmoJc6oZ9ax15nPnF6+5+CkWqd4HjG7V/NdeCMeyG3qnYAO5Z6HIuVZF9VbVjqcZxuvdQJ/dTaS53QV62vNenP3jkMrBnaX93aJEkTMOnQfwxYn2RdktcD1wF7JjwGSerWRKd3qup4kt8FHmSwZHNnVT01yTFMwLKdmlqgXuqEfmrtpU7oq9afkqpa6jFIkibEz9OXpI4Y+pLUEUN/AZKsSfK1JE8neSrJ77f2dyf56yTfSvK/krx16D4fTTKd5JkkVy7d6BcmyRuTPJrkm63WP27t65I80mq6p70gT5I3tP3pdnztkhYwopPU+butlkpy4VD/JPl0O/ZkkkuWbvQLc5Jav9Cen3+TZGeS81r7sqz1JHXe2dqeTHJvkje39mX53F20qvI24g24CLikbb8F+L/AuxisSvp3rf03gVvb9ruAbwJvANYB3wXOWeo6Rqw1wJvb9nnAI8DlwG7gutb+J8Bvt+3fAf6kbV8H3LPUNZxinRcDa4GDwIVD/a8C/ne73+XAI0tdwxhqvaodC/BnQ/+my7LWk9T51qE+nwS2te1l+dxd7M0r/QWoqheq6htt+++BAwzeZfyvga+3bnuB/9i2NwN3V9WPqup7wDSDj6I449XAD9vuee1WwPuBe1v7LuCatr257dOOX5Ekkxnt4s1VZ1U9XlUHZ7nLZuCudr+HgRVJLprQcE/JSWp9oB0r4FEG75+BZVrrSer8Oxj8BQO8icHzGZbpc3exDP1Fan8CXszgKuIpBk8cgA/xkzegzfaxE6smNMRTluScJE8ARxiczL4LvFRVx1uX4Xp+XGs7/jLw9okOeJFeW2dVPXKS7mfVv+lwrW1a5zeAr7SmZVvrXHUm+RzwIvBO4L+17sv2ubsYhv4itLnALwJ/0K4efhP4nST7GUz7/ONSjm9cqurVqnoPgyu/Sxn8j3LWeW2dSX5hiYd02sxT62eAr1fVXy3J4MZorjqr6gbgXzL4K/1Xl26ES8fQX6B2NfRF4AtV9SWAqvp2VW2sql9kMCf63db9rPjYiap6Cfga8F4Gf+KfeFPfcD0/rrUdfxvwg8mO9NQM1bnpJN3Otn/TTQBJbgamgP881G3Z1zrbv2lVvQrczU+mYZf9c3chDP0FaPN8dwIHquqTQ+0/236+DvgYgxc4YfARE9e11QHrgPUM5kzPeEmmkqxo229i8B0IBxj8D/TB1m0LcF/b3tP2ace/2uaIz2hz1Pntk9xlD3B9W9lyOfByVb1w+kd66uaqNclvAVcCv1ZV/zR0l2VZ6xx1PpPkHa0twK/wk3/nZfncXawz7lM2z3DvYzDn+a02XwjwRww+T+imtv8l4HMAVfVUkt0Mvi/gOHBTu8pYDi4CdmXwxTevA3ZX1ZeTPA3cneQ24HEGJ0Haz88nmQaOMlgFsRzMVed/Av4Q+BfAk0keqKrfAh5gsKplGngFuGGJxr0Yc9V6HHgO+Ov2+uWXquoWlm+t/6xO4H7grzJYTh0Gq+p+u/Vfrs/dRfFjGCSpI07vSFJHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUkf8P2gGGBTzRO7IAAAAASUVORK5CYII=",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "res = s.mask((s>s.quantile(0.99))|(s<s.quantile(0.01)))\n",
    "_ = plt.hist(res, bins=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.\n",
    "(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Datetime\n",
       "2016-12-31 23:45:04    0.328574\n",
       "2016-12-31 23:50:03    0.261883\n",
       "2016-12-31 23:55:01    0.262406\n",
       "dtype: float64"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = df.Radiation.rolling('6H').corr(df.Temperature)\n",
    "res.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Datetime\n",
       "2016-08-31 21:00:00    51.218750\n",
       "2016-09-01 03:00:00    50.033333\n",
       "2016-09-01 09:00:00    59.379310\n",
       "Freq: 6H, Name: Temperature, dtype: float64"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = df.Temperature.resample('6H', origin='03:00:00').mean()\n",
    "res.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Datetime\n",
       "2016-12-31 23:45:04    9.33\n",
       "2016-12-31 23:50:03    8.49\n",
       "2016-12-31 23:55:01    5.84\n",
       "Name: Radiation, dtype: float64"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_dt = df.index.shift(freq='-6H')\n",
    "int_loc = [df.index.get_indexer([i], method='nearest') for i in my_dt]\n",
    "int_loc = np.array(int_loc).reshape(-1)\n",
    "res = df.Radiation.iloc[int_loc]\n",
    "res.index = df.index\n",
    "res.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以及，纸质版中提到的另一种方法 `merge_asof()`，这个方法性能更好些。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Datetime\n",
       "2016-12-31 23:45:04    9.33\n",
       "2016-12-31 23:50:03    8.49\n",
       "2016-12-31 23:55:01    5.84\n",
       "Name: Radiation, dtype: float64"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target = pd.DataFrame(\n",
    "    {\n",
    "        \"Time\": df.index.shift(freq='-6H'),\n",
    "        \"Datetime\": df.index,\n",
    "    }\n",
    ")\n",
    "res = pd.merge_asof(\n",
    "    target,\n",
    "    df.reset_index().rename(columns={\"Datetime\": \"Time\"}),\n",
    "    left_on=\"Time\",\n",
    "    right_on=\"Time\",\n",
    "    direction=\"nearest\"\n",
    ").set_index(\"Datetime\").Radiation\n",
    "res.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ex2：水果销量数据集\n",
    "#### 1.\n",
    "(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Month\n",
       "1    1.174998\n",
       "2    0.968890\n",
       "3    0.951351\n",
       "4    1.020797\n",
       "5    0.931061\n",
       "dtype: float64"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('../data/fruit.csv')\n",
    "df.Date = pd.to_datetime(df.Date)\n",
    "df_grape = df.query(\"Fruit == 'Grape'\")\n",
    "res = df_grape.groupby([np.where(df_grape.Date.dt.day<=15,'First', 'Second'),\n",
    "                        df_grape.Date.dt.month])['Sale'].mean().to_frame().unstack(0).droplevel(0,axis=1)\n",
    "res = (res.First/res.Second).rename_axis('Month')\n",
    "res.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date\n",
       "2019-01-31    847\n",
       "2019-02-28    774\n",
       "2019-03-31    761\n",
       "2019-04-30    648\n",
       "2019-05-31    616\n",
       "Name: Sale, dtype: int64"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.Date.dt.is_month_end].query(\"Fruit == 'Pear'\").groupby('Date').Sale.sum().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date\n",
       "2019-01-31    60.500000\n",
       "2019-02-28    59.538462\n",
       "2019-03-29    56.666667\n",
       "2019-04-30    64.800000\n",
       "2019-05-31    61.600000\n",
       "Name: Sale, dtype: float64"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.Date.isin(pd.date_range('20190101', '20191231',freq='BM'))].query(\"Fruit == 'Pear'\").groupby('Date').Sale.mean().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Month\n",
       "1    65.313725\n",
       "2    54.061538\n",
       "3    59.325581\n",
       "4    65.795455\n",
       "5    57.465116\n",
       "Name: Sale, dtype: float64"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_dt = df.drop_duplicates().groupby(df.Date.drop_duplicates().dt.month)['Date'].nlargest(5).reset_index(drop=True)\n",
    "res = df.set_index('Date').loc[target_dt].reset_index().query(\"Fruit == 'Apple'\")\n",
    "res = res.groupby(res.Date.dt.month)['Sale'].mean().rename_axis('Month')\n",
    "res.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>January</th>\n",
       "      <th>February</th>\n",
       "      <th>March</th>\n",
       "      <th>April</th>\n",
       "      <th>May</th>\n",
       "      <th>June</th>\n",
       "      <th>July</th>\n",
       "      <th>August</th>\n",
       "      <th>September</th>\n",
       "      <th>October</th>\n",
       "      <th>November</th>\n",
       "      <th>December</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fruit</th>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Apple</th>\n",
       "      <th>Mon</th>\n",
       "      <td>46</td>\n",
       "      <td>43</td>\n",
       "      <td>43</td>\n",
       "      <td>47</td>\n",
       "      <td>43</td>\n",
       "      <td>40</td>\n",
       "      <td>41</td>\n",
       "      <td>38</td>\n",
       "      <td>59</td>\n",
       "      <td>42</td>\n",
       "      <td>39</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tue</th>\n",
       "      <td>50</td>\n",
       "      <td>40</td>\n",
       "      <td>44</td>\n",
       "      <td>52</td>\n",
       "      <td>46</td>\n",
       "      <td>39</td>\n",
       "      <td>50</td>\n",
       "      <td>42</td>\n",
       "      <td>40</td>\n",
       "      <td>57</td>\n",
       "      <td>47</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wed</th>\n",
       "      <td>50</td>\n",
       "      <td>47</td>\n",
       "      <td>37</td>\n",
       "      <td>43</td>\n",
       "      <td>39</td>\n",
       "      <td>39</td>\n",
       "      <td>58</td>\n",
       "      <td>43</td>\n",
       "      <td>35</td>\n",
       "      <td>46</td>\n",
       "      <td>47</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thu</th>\n",
       "      <td>45</td>\n",
       "      <td>35</td>\n",
       "      <td>31</td>\n",
       "      <td>47</td>\n",
       "      <td>58</td>\n",
       "      <td>33</td>\n",
       "      <td>52</td>\n",
       "      <td>44</td>\n",
       "      <td>36</td>\n",
       "      <td>63</td>\n",
       "      <td>37</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fri</th>\n",
       "      <td>32</td>\n",
       "      <td>33</td>\n",
       "      <td>52</td>\n",
       "      <td>31</td>\n",
       "      <td>46</td>\n",
       "      <td>38</td>\n",
       "      <td>37</td>\n",
       "      <td>48</td>\n",
       "      <td>34</td>\n",
       "      <td>37</td>\n",
       "      <td>46</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Date        January  February  March  April  May  June  July  August  \\\n",
       "Fruit Date                                                             \n",
       "Apple Mon        46        43     43     47   43    40    41      38   \n",
       "      Tue        50        40     44     52   46    39    50      42   \n",
       "      Wed        50        47     37     43   39    39    58      43   \n",
       "      Thu        45        35     31     47   58    33    52      44   \n",
       "      Fri        32        33     52     31   46    38    37      48   \n",
       "\n",
       "Date        September  October  November  December  \n",
       "Fruit Date                                          \n",
       "Apple Mon          59       42        39        45  \n",
       "      Tue          40       57        47        47  \n",
       "      Wed          35       46        47        38  \n",
       "      Thu          36       63        37        40  \n",
       "      Fri          34       37        46        41  "
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "month_order = ['January','February','March','April','May','June','July','August','September','October','November','December']\n",
    "week_order = ['Mon','Tue','Wed','Thu','Fri','Sat','Sum']\n",
    "group1 = df.Date.dt.month_name().astype('category').cat.reorder_categories(month_order, ordered=True)\n",
    "group2 = df.Fruit\n",
    "group3 = df.Date.dt.dayofweek.replace(dict(zip(range(7),week_order))).astype('category').cat.reorder_categories(week_order, ordered=True)\n",
    "res = df.groupby([group1, group2,group3])['Sale'].count().to_frame().unstack(0).droplevel(0,axis=1)\n",
    "res.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2019-01-01    189.000000\n",
       "2019-01-02    335.500000\n",
       "2019-01-03    520.333333\n",
       "2019-01-04    527.750000\n",
       "2019-01-05    527.750000\n",
       "Freq: D, dtype: float64"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_apple = df[(df.Fruit=='Apple')&(~df.Date.dt.dayofweek.isin([5,6]))]\n",
    "s = pd.Series(df_apple.Sale.values,index=df_apple.Date).groupby('Date').sum()\n",
    "res = s.rolling('10D').mean().reindex(pd.date_range('20190101','20191231')).fillna(method='ffill')\n",
    "res.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('jp')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 2,
  "vscode": {
   "interpreter": {
    "hash": "e8055c4814a31ce50b7a636fa9c3224c9f52e79c0451e1c70c3e86e150b3589a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
